多模态大模型#
学习资料#
- [2025.05] Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
- [2025.05] Thinking with Images for Multimodal Reasoning:Foundations, Methods, and Future Frontiers
- [2025.05] DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning 小红书,带图片cot,放大图片
- [2024.08] Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Discern Causal Links Across Modalities
OCR#
- [2025.10] Glyph: Scaling Context Windows via Visual-Text Compression 智谱
- [2025.10] DeepSeek-OCR: Contexts Optical Compression
- [2024.09] General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model 旷视。端到端架构,encoder-decoder。
开源项目#
- [2025.04] Kimi-VL Technical Report
- [2025.01] MiniMax-01: Scaling Foundation Models with Lightning Attention
- [2024.03] VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks RoPE-2D的引入有助于提升模型效果尤其是变分辨率输入的效果
- minimind-v 极简vlm模型
- LLaVA
- [2023.10] Improved Baselines with Visual Instruction Tuning
- [2023.04] Visual Instruction Tuning
Qwen#
- [2025.11] Qwen3-VL Technical Report
- [2025.02] Qwen2.5-VL Technical Report
- [2023.08] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Deepseek#
核心模块#
Encoder-Decoder#
位置编码#
- [2024.09] “闭门造车”之多模态思路浅谈(三):位置编码
- [2024.03] RoPE-Tie Transformer升级之路:17、多模态位置编码的简单思考